Aligning Bilingual Corpus: Especially for Language Pairs from Different Families
نویسنده
چکیده
Rather than using length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as Sinorama magazine and IBM user manual. The experimental results show simulated annealing approach has very good performance. In aligning texts from Sinorama magazine, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning IBM user manual.
منابع مشابه
Phrase Alignment Based on Combination of Multiple Strategies
Phrase translation pairs are very useful for bilingual lexicography, machine translation system, crosslingual information retrieval and many applications in natural language processing. There is phrase boundary information in parsing trees of sentences. Linguistics knowledge in translation lexicon and semantic lexicon, and statistics results from bilingual corpus can be used to align Chinese wo...
متن کاملAligning Noisy Parallel Corpora Across Language Groups : Word Pair Feature Matching by Dynamic Time Warping
We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.
متن کاملTibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
Sentence-level aligning bilingual parallel corpus is shown significant and indispensable status in machine translation, translation knowledge acquiring and bilingual lexicography research fields, which is the fundamental work for natural language processing. Given the great deal of work in sentence alignment and a variety of methods have developed for bilingual terminology extraction, those are...
متن کاملCross Sentence Alignment for Structurally Dissimilar Corpus Based on Singular Value Decomposition
Extracting the alignment pairs is a critical step for constructing bilingual corpus knowledge base for Example Based Machine Translation Systems. Different methods have been proposed in aligning parallel corpus between two different languages. However, most of them focus on structurally similar languages like English-French. This paper presents a method of cross aligning Portuguese-Chinese bili...
متن کاملCreating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet
The growing importance of multilingual information retrieval and machine translation has made multilingual ontologies an extremely valuable resource. Since the construction of an ontology from scratch is a very expensive and time consuming undertaking, it is attractive to explore ways of automatically aligning monolingual ontologies which already exist. This paper presents a language-independen...
متن کامل